-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Implement dynamic gang header sizes #17004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
module/zcommon/zfeature_common.c
Outdated
zfeature_register(SPA_FEATURE_DYNAMIC_GANG_HEADER, | ||
"com.klarasystems:dynamic_gang_header", "dynamic_gang_header", | ||
"Support for dynamically sized gang headers", | ||
ZFEATURE_FLAG_ACTIVATE_ON_ENABLE, ZFEATURE_TYPE_BOOLEAN, NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I also thought about this gang block issue, it also immediately became obvious that it would require read-incompatible pool feature. I am very unhappy about it. It would require updates to at least GRUB and FreeBSD loader to not break booting or careful compatibility settings from every user and distribution. The "activate on enable" makes it even worse, while I bet in many cases it could be avoided, for example for any pool with at least one ashift=9
vdev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACTIVATE_ON_ENABLE can certainly be avoided. If nothing else, we could have a check in zio_write_gang_block
that only activates it if we get allocations back that let us use a header of size > 512. In theory, we could modify the feature refcount for every single gang block created and freed, but that feels like a bit much, since each time you need to create a dmu_tx_t, assign it to the io's txg, etc. Doing it once you actually write a new-style gang block for the first time and never deactivating it should mean that for most pools, it never gets activated, and for those with vdevs with 512 byte sectors, it also wouldn't get activated.
read-incompatible, on the other hand, is definitely unavoidable. For now at least, people can avoid enabling the feature on their root pools, so the bootloader issue isn't as pressing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"avoid activating the feature" means never running zpool upgrade
for existing pools, or setting right compatibility
property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out a slightly bigger hammer than expected is needed here, so we definitely don't want to update this for every single gang header. It turns out that 1) I forgot that this featureflag needed the MOS flag and 2) MOS features basically have to be updated from syncing context. This means we have to spin out a synctask to do the job for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "activate on enabled" is likely risky here, because that would mean someone without upgraded bootcode would work fine until one day they birthed a gang block, and now it doesn't boot.
Explicitly requiring a zpool upgrade
to enable the feature feels far safer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"avoid activating the feature" means never running
zpool upgrade
for existing pools, or setting rightcompatibility
property on creation or at least before upgrading. At this time we don't even support disabling features that were enabled by mistake and that are not active. We should implement that at some point BTW. This might be a good motivation, if we accept the incompatibility.
I think this is possible with zhack
, but doing it as a real feature (zpool set feature@name=disabled $pool
) does seem useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the latest version the feature isn't ACTIVATE_ON_ENABLE anymore. In addition, even when it was, you would still need a zpool upgrade
to enable the feature before it would get enabled (and therefore activated), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, although most people blindly zpool upgrade
2ff3989
to
9843def
Compare
Here's a prototype of the changes to do variable leaf size allocations and use those for gang allocations. It has three primary parts. The first is changes to the allocation code to support allocation ranges, instead of specific sizes. The second is the concept of a "preallocated" write zio, which uses the normal write code path but skips parts of it since the allocation has already been done. The third part is changes to the gang creation code to take advantage of both of the above. I also added some tests for gang related code, since the test suite doesn't have that much in the way of those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach to variable size allocation, though code may use some polish. But in pre-allocated writes you've lost me. I haven't looked very close yet, but if we allocate space upfront, I wonder why it can't be implemented with existing rewrite zios?
More comments below, but I also wondering if we could implement variable size allocations first within existing gang blocks without introducing the new pool feature. I think it could help significantly by itself without breaking compatibility.
module/zfs/zio.c
Outdated
zio->io_bp = pio->io_bp; | ||
*zio->io_bp = zio->io_bp_orig = *bp; | ||
|
||
if (zio->io_abd != NULL) | ||
abd_free(zio->io_abd); | ||
zio->io_orig_abd = zio->io_abd = pio->io_abd; | ||
|
||
zio->io_size = zio->io_orig_size = allocated_size; | ||
zio->io_lsize = allocated_size; | ||
zio->io_done = NULL; | ||
zio->io_orig_pipeline = zio->io_pipeline = | ||
(zio->io_pipeline & ~ZIO_GANG_STAGES) | | ||
ZIO_STAGE_WRITE_COMPRESS | ZIO_STAGE_DVA_ALLOCATE; | ||
zio->io_flags |= ZIO_FLAG_PREALLOCATED; | ||
zp.zp_type = zp.zp_storage_type = pio->io_prop.zp_type; | ||
zp.zp_level = pio->io_prop.zp_level; | ||
zio->io_prop = zp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are gang blocks with only one child illegal? Because this mess with live zio conversion from rewrite to regular write IMHO should be. Alternatively I think could be cleaner to replace the rewrite zio pipeline with ZIO_INTERLOCK_STAGES and let it quietly complete without doing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One-child gang blocks aren't illegal, but they are pretty gross. I didn't think this code was too bad, but I agree it's definitely not very clean. I'm fine stripping that out if people don't think it has value. I think just replacing it with the INTERLOCK stages will probably break the throttle accounting, but that will all change once your updated space accounting lands, so you may have a better idea about than I do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about this more, it might actually work fine with a few tweaks. Let me try it out and see.
Rather than implementing disabling features, which is a little tricky, I am playing with a lighter-weight solution here. This PR adds the idea of a feature that isn't enabled by upgrade by default. You can still enable it manually, and if it's in your compat file it will be enabled. There are some quirks here; I'm not sure exactly how to handle the code in upgrade that prints all un-enabled features; should it display this one, even though it won't be added by upgrade? Should it skip this one, even though it isn't enabled? For now, I left it as the former; there are some upgrade tests that had to be tweaked to work with this approach. I'm curious if anyone has feelings about this idea, or preferences for what the behavior should be for listing unenabled features. |
f85a90b
to
7a9513d
Compare
This is a rebased version of the changes; the first commit is the entirety of #17111 squashed down. I haven't run the tests or anything, just trying to keep this from getting too far out of sync with master. |
re:
Does the feature activation refcount stuff track the TXG of when a feature was activated? Would we be able to use the birth_txg of the gang block to tell if it should be 512b vs minimum asize? or does that involve looking inside the block we are trying to read? |
I believe we only track the TXG at which the feature was enabled. Because we use the small gang header unless we can't fit the necessary leaves, we can't just rely on that. A gang block created after the feature was enabled might use the small size, in order to avoid activating the feature unnecessarily. If we had activation_txg as well, we would also need to tweak the code to always use the larger header once the feature is active. Not a big change, but worth noting. |
7dafb0d
to
4fe393c
Compare
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
tests/zfs-tests/tests/functional/gang_blocks/gang_blocks_001_pos.ksh
Outdated
Show resolved
Hide resolved
Signed-off-by: Paul Dagnelie <[email protected]>
One other idea would be to have "creation-only feature flags" that are only enabled at pool creation time. You could never upgrade to these features on existing pools, nor would they be listed on the "zpool upgrade" list. The benefit is we could safely enable it by default (on new pools), and it removes the need for "optional feature upgrades", and might simplify some of the codepaths. Sure, you wouldn't be able to upgrade existing pools, but that's an ephemeral cost. Over time, the fact that it's on by default for 100% of new pools will make up for it. One other question - does the code work in such a way that we will always know if |
My worry for this case would be that someone creates a boot pool with the new feature while unknowingly using an old bootloader that doesn't support it yet. It usually wouldn't be a huge issue (they'll find out when they boot the system onto the root pool for the first time), but it's an irritation. It is also valuable to be able to upgrade existing pools, since there are probably plenty of people out there with 4k pools who would like to use larger gang block headers without having to recreate their setups. That said, creation-only flags are probably simpler than this design. If people feel strongly about it I'm not opposed to switching.
I believe the way this works is that the label config contains a features_for_read list that stores all the label features the system needs to support in order to import the pool. During I think this is a general problem that had to be solved for any read-incompatible mos-required feature (e.g. hole birth, embedded BPs, device removal, draid, etc), so there shouldn't be anything special here. |
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Motivation and Context
ZFS gang block headers are currently fixed at 512 bytes. This is increasingly wasteful in the era of larger disk sector sizes. This PR allows any size allocation to work as a gang header. It also contains supporting changes to ZDB to make gang headers easier to work with.
Description
The ZDB changes are first; basically, there are just some small tweaks to make it easier to work with gang blocks. First, the compact blkptr printer now notes which DVAs have the gang bit set. There is also a fix of a bug that's been around since 2009; ZDB gang block header printing has been broken since then. The problem is that if you do a
zio_read
of a BP with the gang bit set, you don't get back the header, you get back the underlying data. The fix is to just not set the gang bit.The way dynamic sized gang headers work is that the amount of space we allow ourselves to use for the gang header is equal to the size of the smallest possible allocation on the vdevs we got back when we allocated it. This is necessary to work around the fact that the ASIZE for a gang BP isn't the allocated size of the gang header, but of the entire tree under it. Because of that fact, when reading, claiming, or freeing, the allocated size of a gang header must be able to be determined from no other information than the vdev it was allocated on. We reuse the existing
vdev_gang_header_asize
for this. We take this minimum space, and pack the block full of blkptrs, up to a tail with an embedded checksum. This allows us to store many more gang children per header, leading to much shallower gang trees if we're forced into intense ganging.One important wrinkle is that if the pool we're using has old gang headers on it, they may be smaller than the smallest allocation of the relevant vdevs. We have a workaround for this in
zio_checksum_error
, which will now try the checksum again in the case of a gang block of size above 512 bytes, with the size reduced to 512. This should not pose a significant performance problem, since 1) calculating a checksum for a block that small doesn't take very long, and 2) hopefully the number of gang blocks on systems is not too high. This also only applies to systems that have both old gang headers and the feature enabled.Much of the remaining changes are just tweaking the gang tree orchestration logic to work with a variable number of gang block pointers. The final interesting change is to the logic that determines the size of the gang children. When only 3 gang children were possible, it didn't really matter what the sector size was; you were going to be allocating at most 3 of them, so the space waste was limited. With large sector sizes, however, it could get expensive: A 128k allocation that gangs on a system with 64k sectors will, without changes, consume 256 64k sectors, with 512 bytes of data each, wasting 128x as much space as the original allocation. The fix is to use the spa_min_ashift rather than the MINBLOCKSIZE when calculating the lsize.
How Has This Been Tested?
In addition to the ZTS and zloop, extensive manual tests were performed, verifying that 1) the new ZDB functionality works, and 2) that larger gang headers are correctly used and allocated when applicable. Mixed ashift testing occured, as well as the full compatibility matrix of old pools, new pools without the feature enabled running against old and new code.
Types of changes
Checklist:
Signed-off-by
.